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ABSTRACT 


This  report  describes  the  design  and  implementation  of  a  systol¬ 
ic  convolution  processor.  It  uses  a  number  of  bit-serial  processors, 
each  operating  on  8-bit  signed  coefficients  and  8-bit  positive  pixel 
intensities.  The  processing  time  is  independent  of  the  coefficient 
array  size  due  to  the  architecture  selected.  Overflow  detection  and 
output  scaling  capabilities  are  provided  for  autonomous  applications. 
Two  processors  were  placed  on  a  CMOS  custom  integrated  circuit  using  5p 
design  rules.  Simulation  results  were  used  to  estimate  the  processing 
time  to  be  under  0.5  s  for  a  512  x  512  pixel  image. 


RESUME 


Ce  rapport  d£crit  la  conception  et  la  realisation  d'un  proces- 
seur  systolique  specialise  pour  la  convolution.  Etant  donn£  l'archi- 
tecture  utilisee,  le  temps  de  calcul  est  ind£pendant  du  nombre  de  coef¬ 
ficients.  Les  processeurs  systoliques  sont  sSriels  et  utilisent  des 
coefficients  sign£s  de  8  bits  ainsi  que  des  pixels  positifs  de  8  bits. 
Une  certaine  capacit§  de  dStecter  les  erreurs  de  calcul  existe  pour  les 
applications  autonomes.  Le  processeur  fut  realise  en  circuit  int£gr£ 

CMOS  suivant  des  regies  de  conception  de  5u .  A  l'aide  de  simulations, 
on  a  estim£  qu'une  image  de  512  x  512  pixels  pourrait  Stre  traitge  en 
moins  de  0.5  s. 
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1.0  INTRODUCTION 

Over  the  past  years,  Defence  Research  Establishment  Valcartier 
(DREV)  has  been  involved  in  the  automatic  detection  and  tracking  of 
targets  in  imagery.  These  efforts  have  resulted  in  the  development  of 
many  algorithms,  some  of  which  use  digital  convolution  to  implement 
linear  and  certain  non-linear  filters  (e.g.  Sobel  edge  detector). 

The  two-dimensional  convolution  may  be  described  by  eq.  1,  where 
y(m,n)  is  the  output  image,  x(m,n)  the  input  image,  and  h(m,n)  is  the 
system's  impulse  response.  In  practice,  the  convolution  is  always 
summable  because  (1)  the  image  and  the  impulse  response  are  defined 
only  in  a  limited  range  of  k  and  1;  and  (2)  the  image  intensity  and  the 
impulse  response  coefficients  are  bounded. 


y(m,n)  =  £  I  x(k,l)  h(m-k,  n-1)  [ l] 

k=~°0  — OD 

Throughout  this  report,  the  terms  impulse  response,  coefficient 
array,  and  convolution  kernel  will  be  used  interchangeably. 

Because  of  its  importance  in  image  processing  applications  and 
its  heavy  computational  demand,  many  people  have  implemented  digital 
convolution  in  hardware.  Among  such  implementations,  it  is  worth 
mentioning  the  convolution  computer  described  by  Heuft  and  Little 
(Ref.  1),  which  uses  a  number  of  identical  processors  in  parallel,  each 
working  on  a  different  point  of  the  output  image.  Other  implementa¬ 
tions  include  the  ones  by  Comtal  Corp.  (Ref.  2)  and  Quantex  Corp. 

(Ref.  3)  which  were  real-time  versions  limited  to  3  x  3  coefficient 
arrays.  More  recently,  systolic  implementations  have  been  proposed  by 
Rung  et  al  (Refs.  4-6),  McCanny  et  al  (Refs.  7-9),  Danielson  (Ref.  10) 
and  many  more. 
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A  survey  of  image  processing  techniques  indicated  the  need  to 
rapidly  process  large  convolution  kernels,  typically  up  to  15  x  15. 

For  compatibility  with  current  digitizers,  a  convolver  should  be  able 
to  handle  images  with  up  to  8  bits/pixel  and  varying  image  size.  Such 
a  processor  should  be  as  compact  as  possible  and  some  capabilities  for 
autonomous  operation  should  be  provided. 

Ibis  report  describes  the  design  and  implementation  of  a  systol¬ 
ic  convolution  processor  that  operates  bit-serially  and  provides  an 
overflow  detection  circuitry  which  enables  it  to  perform  unsupervised 
applications.  The  design  was  targeted  for  Very  Large  Scale  Integrated 
(VLSI)  circuit  implementation  in  order  to  maximize  the  convolution  ar¬ 
ray  size  while  minimizing  the  total  area  requirements  of  the  convolver. 
The  circuit  was  manufactured  by  Northern  Telecom  through  Canadian 
Microelectronics  Corporation  (CMC). 

The  main  part  of  this  work  was  performed  while  the  author  was  on 
educational  leave  at  McGill  University,  and  was  completed  at  DREV  be¬ 
tween  March  1985  and  May  1986  under  PCN  32D18  Real-Time  Image  Filters. 

2.0  ARCHITECTURE 


This  chapter  introduces  the  basic  concepts  used  and  describes 
the  architecture  selected  for  this  convolution  processor.  Section  2.1 
introduces  the  systolic  arrays,  Section  2.2  describes  the  advantages 
and  disadvantages  of  bit-serial  communications  and  processing.  Section 
2.3  deals  with  the  architecture  selected  for  the  individual  processors 
while  Section  2.4  describes  how  these  processors  may  be  interconnected 
to  perform  the  desired  two-dimensional  convolution. 
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2. 1  Systolic  Arrays 

The  architecture  used  in  this  design  is  based  on  the  principle 
of  a  systolic  array  as  introduced  by  Kung  and  Leiserson  (Refs.  4,  11 
and  12).  This  approach  uses  a  number  of  identical  processors  organized 
in  a  regular  array.  A  global  (synchronous)  clock  "pumps"  data  through 
the  array  of  processors  in  the  same  way  as  the  heart  pumps  blood 
through  the  arteries  of  the  body.  This  approach  has  numerous  advan¬ 
tages,  including: 

1)  The  reduction  of  the  input/output  requirements  through  the 
use  of  the  same  data  many  times; 

2)  The  ease  of  expansion  through  modularity  and  regular  flow  of 
data  and  control  signals; 

3)  The  ease  and  speed  of  the  design  and  implementation  cycles 
because  of  the  repetitive  use  of  simple  cells;  and 

4)  The  use  of  local  communications,  eliminating  the  need  for 
global  interconnections. 

There  are  also  a  few  disadvantages  to  systolic  arrays,  including 
the  large  number  of  I/O  signals  they  usually  require,  as  well  as  their 
lack  of  programmability  in  most  cases. 

The  basic  structure  of  a  systolic  processor  is  a  regular  array 
of  identical  processors,  each  of  which  may  only  communicate  with  their 
nearest  neighbors.  For  example,  let  us  now  investigate  the  computation 
of  a  one-dimensional  convolution  (eq.  2a)  using  a  systolic  processor, 
and  let  us  limit  the  impulse  response  of  the  system  to  only  3  coeffi¬ 
cients  (eq.  2b). 
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y(m)  =  1  x(k)  h(m-k)  *  £  h(k)  x(m-k)  [  2a] 

k=_oo  k»-“ 

y(m)  =  h(0)  x(m)  +  h ( 1 )  x(m-l)  +  h(2)  x(m-2)  [2b] 

Figure  1(a)  shows  a  linear  convolver  described  by  Rung  and 
Picard  (Ref.  5).  Each  processor  is  used  to  perform  part  of  the  compu¬ 
tation  for  a  single  output  point.  Figure  1(b)  shows  the  functionality 
of  an  individual  processor.  In  particular,  the  processor  stores  the 
value  of  the  coefficient  in  an  internal  register,  computes  the  product 
of  h(k)  times  x(l)  and  adds  this  product  to  the  Yin  input.  The  XQut 
output  is  simply  a  delayed  version  of  the  input  signal  X^n,  necessary 
for  the  proper  synchronization  of  the  processors. 

Input 


FIGURE  1(a)  -  Systolic  linear  convolver 


^in 

"^OUt 

Yin 

f  l 

i  i 
i — ! 

H 

Xout(t)  =  Xjn(t-2) 

Yout  Yout(t)  =  Yin(t-l)  +  h()Xin(t-l) 

FIGURE  1(b)  -  Systolic  processing  element 
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Figure  2  shows  an  example  of  the  operation  of  such  a  convolver 
through  a  number  of  clock  cycles.  Ihe  example  shows  a  three-point, 
linear  convolution  with  coefficients  a,  b  and  c,  on  clock  cycles  5 
through  9.  The  desired  result  is  of  the  form  of  eq.  2b,  where 
h(0)  =  a,  h(l)  =  b  and  h ( 2 )  =  c. 


(a)  5th  clock  cycle 


(b)  6th  clock  cycle 


cx(6)  cx(5)  +  bx(4) 

(c)  7th  clock  cycle 


cx(7)  cx(6)  +  bx(5)  cx(5)  +  bx(4) 

+  ax(3) 


(d)  8th  clock  cycle 


cx(8)  cx(7)  +  bx(6) 

(e)  9th  clock  cycle 


cx(6)  +  bx(5) 
+  ax(4) 


FIGURE  2  -  Example  of  the  operation  of  a  3-coefficient  linear 
convolution  processor 
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The  inputs  x(n)  are  entered  on  every  cycle.  Each  processor 
computes  part  of  the  convolution  for  every  output  point.  In  particu¬ 
lar,  the  first  processor  computes  cx(5)  on  tne  5th  cycle,  the  second 
processor  computes  bx(4),  and  adds  it  to  the  result  from  the  first 
processor,  resulting  in  the  partial  sum  cx(5)  +  bx(4)  on  the  6th  cycle, 
while  the  last  processor  computes  the  missing  element  ax(3),  completing 
the  computation  of  the  convolution  point  with  the  desired  result  on  the 
7th  cycle:  cx(5)  +  bx(4)  +  ax(3)  =  y ( 5 ) . 

2.2  Bit-Serial  Communications 


Bit-serial  computing  consists  in  processing  the  data  one  bit  at 
a  time.  This  method  is  often  used  in  highly  parallel  architectures 
such  as  the  MPP  (Ref.  13),  the  CLIP4  (Refs.  13-14)  and  many  more.  Its 
advantages  include: 

-  Reduced  number  of  I/O  signals, 

-  Smaller  processor  size  for  higher  density, 

-  Simpler  elements  for  faster  design  cycle,  and 

-  Less  propagation  delay  for  a  shorter  clock  cycle. 

However,  these  advantages  are  offset  by  the  following 
inconvenients  : 

-  Slower  operation,  and 

-  More  complex  control  structure. 

Even  though  parallel  computations  are  certainly  much  faster,  it 
is  interesting  to  note  that  bit-serial  computations  will  execute  at  a 
faster  clock  rate,  but  require  more  cycles  to  complete  the  operations. 
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Through  simulations,  it  was  estimated  that  the  bit-serial  pro¬ 
cessor  would  meet  the  design  requirements  of  less  than  1  s  to  process  a 
512  x  512  pixel  image.  It  was  then  decided  to  use  this  approach  in 
order  to  minimize  the  size  of  the  processors  and  hence  maximize  the 
size  of  the  coefficient  array  that  could  be  handled  in  a  specific  area. 

2.3  Frocessor  Architecture 


Ihe  basic  architecture  of  this  convolver  is  similar  to  the  one 
described  in  Section  2.1,  but  with  a  number  of  modifications  to  accom¬ 
modate  our  choices  of  using  bit-serial  communications  and  processing, 
as  well  as  our  need  to  include  some  form  of  scaling  of  the  multiplier 
output  and  overflow  detection.  This  section  describes  the  processor 
architecture  adopted  for  this  project. 

2.3.1  Data  Representation 

In  such  a  design,  it  is  important  to  determine  the  accuracy  and 
resolution  that  will  be  required  at  the  different  steps  of  the  process¬ 
ing.  In  the  case  of  convolution,  the  output  is  often  an  image  with  the 
same  gray  scale  resolution  as  the  input  image;  however,  it  is  important 
to  obtain  results  with  a  greater  dynamic  range  in  order  to  reduce  the 
rounding  errors  to  a  minimum. 

Let  us  now  investigate  the  number  of  bits  and  the  data  formats 
required  to  represent  the  data.  First,  the  input  pixels  should  be  rep¬ 
resented  in  a  format  compatible  with  today's  digitizers.  Certainly  the 
most  widely  accepted  format  is  to  represent  pixels  as  8-bit,  unsigned 
numbers  ranging  in  values  from  0  to  255.  For  synchronization  purposes, 
this  8-bit  v.  'ue  is  stored  in  the  least  significant  portion  of  a  16-bit 
word . 
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On  the  other  hand,  representing  tne  coefficients  as  positive 
values  would  not  be  satisfactory,  since  it  would  restrict  the  flexi¬ 
bility  of  the  processor.  It  was  determined  that  a  two's  complement 
representation  of  8-bit  (values  ranging  from  -128  to  +127)  would  allow 
sufficient  range  for  most  applications. 

Finally,  one  must  represent  the  partial  convolution  sum  (Y) 
data.  To  simplify  the  control  structure  of  the  processors,  it  was 
decided  to  assign  it  the  same  format  as  the  output  of  the  multiplier, 
i.e.  16-bit,  two's  complement  (values  ranging  from  -32768  to  +32767). 

2.3.2  Processor  Block  Diagram 

The  structure  of  the  processor  is  shown  in  Fig.  3,  and  a  short 
description  of  the  signals  is  given  in  Table  I.  The  processor  contains 
a  serial-parallel  multiplier  used  to  compute  the  product  of  the  coeffi¬ 
cient  with  the  input  pixel  intensity,  and  a  serial  adder  circuit  for 
the  partial  convolution  sum  input  and  the  output  of  the  multiplier.  It 
also  contains  an  8-bit  shift  register  used  to  store  the  coefficient,  a 
32-bit  (2  words)  shift  register  to  synchronize  the  pixel  intensity 
data,  and  a  16-bit  shift  register  (1  word)  to  synchronize  the  flow  of 
the  partial  convolution  sum  data. 

Two  additional  blocks  have  been  included  for  autonomous  opera¬ 
tion.  One  is  the  overflow  detection  circuit  used  to  flag  overflows  and 

underflows  at  the  output  of  the  serial  adder  circuit.  The  OV  .  flag 

r  out 

is  generated  by  a  logical  OR  of  the  output  of  this  circuit  along  with 
the  signal;  this  flag  indicates  the  occurrence  of  an  error  while 

computing  the  current  output  point  Y  . 


UNCLASSIFIED 

9 


TABLE  I 


Description  of  processor  signals 


Symbol 

Name 

Description 

Xin 

Pixel  input 

Input  gray  level,  16-bit  positive 
value  (0  X^n<  255). 

Cin 

Coefficient 

Input 

Input  coefficient,  8-bit  two's 
complement  value. 

Yin 

Partial  Sum 
Input 

Input  partially  computed  convolution 
sum,  16-bit  two's  complement  number. 

c 

•H 

> 

o 

Overflow 

Input 

Input  flag  indicating  whether  an 
overflow  occurred  while  computing  Y^n. 

Xout 

Pixel 

Output 

Output  gray  level,  16-bit  value 
(0  Xjn<  255). 

Cout 

Coefficient 

Output 

Output  coefficient,  8-bit  two's 
complement  number. 

Yout 

Partial  Sum 
Output 

Output  partial  convolution  sum, 

16-bit  two's  complement  number. 

0Vout 

Overflow 

CXitput 

Output  overflow  flag  indicating  if  an 
overflow  occurred  during  the  computa- 

tion  of  Youf 

Ld 

Load 

Coefficient 

When  active,  the  coefficient  shift- 
register  is  activated  and  the 
coefficient  is  loaded  through  C^n. 

Pe 

Processor 

Enable 

When  active,  the  whole  processor  is 
activated  and  the  value  of  YQUt 
computed. 

ResetA 

Adder  Reset 

When  active,  the  adder  circuit  is 
reset . 

ResetM 

Multiplier 

Reset 

When  active,  the  multiplier  is  reset. 

Sd 

Sum  Disable 

When  active,  the  output  of  the 
multiplier  is  replaced  with  the  sign 
of  the  coefficient.  This  is  used  for 
dynamic  range  extension. 

$  ^  and  $2 

Clocks 

Two  phase,  non-overlapping  clock. 

FIGURE  3  -  Block  diagram  of  processor  architecture 
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The  second  circuit  is  a  sum-disable  circuit  used  to  scale  the 
output  of  the  multiplier  by  a  factor  of  2  n,  where  n  is  a  positive 
integer  representing  the  number  of  cycles  during  which  the  signal  Sd  is 
active.  This  circuit  is  useful  to  scale  down  the  output  of  the  multi¬ 
plier  in  order  to  reduce  occurrences  of  overflows.  In  a  typical  sys¬ 
tem,  all  of  the  processors  would  share  a  common  Sd  signal. 

2.3.3  Processor  Operation 

In  order  to  simplify  the  control  structure,  all  the  processors 
are  synchronized  on  word  boundaries,  each  operating  on  the  same  bit  at 
the  same  time.  The  correct  operation  of  the  circuit  requires  16  clock 
cycles.  Figure  4  shows  a  typical  operating  waveform. 

The  pixel  intensities  and  the  partial  convolution  sums  are 
entered  in  the  processor  one  bit  at  a  time,  starting  with  the  least 
significant  bit.  On  the  last  cycle,  the  most  significant  bit  is 
entered  to  complete  the  processing.  It  is  important  to  note  that  while 
bit  3  is  being  entered  in  the  processor,  bit  3  of  another  word  is  also 
available  at  the  output  so  that  the  processors  may  easily  be  chained  to 
perform  convolutions  of  arbitrary  lengths. 

2.4  Convolver  System 

Ihe  structure  of  a  1-D  convolver  using  this  processor  is  very 
similar  to  the  one  described  in  Section  2.1.  Figure  5  shows  such  a 
convolver  based  on  the  processor  described  in  Section  2.3.  The  over¬ 
flow  signals  are  connected  to  form  a  shift  register  in  order  for  the 

OV  flag  to  always  represent  the  validity  of  the  Y  result, 
out  °ut 
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Based  on  the  processor  described  in  the  previous  section,  it  is 
now  possible  to  design  a  two-dimensional  convolver.  The  convolution 
suia  of  eq.  1  may  be  separated  into  a  sum  of  1-D  convolutions.  Hence,  a 
2-D  convolution  may  be  performed  using  a  structure  as  shown  in  Fig.  6. 
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FIGURE  6  -  Two-dimensional  convolver 


One  important  characteristic  of  systolic  arrays  is  their  expan¬ 
sibility,  which  is  due  to  their  basic  regular  structure.  The  proposed 
structure  maintains  this  characteristic.  It  enables  the  user  to  employ 
the  number  of  processors  required  for  the  particular  application,  using 
one  processor  per  filter  coefficient.  The  dimension  of  the  image  may 
also  be  changed  without  affecting  the  array  of  processors,  although  the 
delay  structure  would  have  to  be  modified  accordingly. 

3.0  DESIGN 


This  chapter  introduces  some  of  the  CMOS  design  concepts  and 
describes  the  main  building  blocks  used  in  this  design,  including  a 
serial-parallel  multiplier,  an  adder,  a  shift  register  and  some  other 
basic  circuits.  This  chapter  also  gives  a  description  of  how  these 
blocks  were  assembled  to  form  an  integrated  circuit. 

3.1  CMOS  Gates 


This  design  was  implemented  through  an  agreement  between 
Northern  Telecom  and  Canadian  Microelectronic  Corporation.  There  were 
two  technologies  available:  a  5ym  nMOS  and  a  5ym  CMOS  process. 
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Because  Northern  Telecom  was  phasing  out  their  nMOS  process,  it  was 
thus  decided  to  use  CMOS  for  this  design.  The  new  3p  process  was  not 
available  when  this  circuit  was  designed. 

There  are  many  ways  to  implement  CMOS  circuits.  One  approach  is 
to  use  static  gates,  which  are  very  similar  to  normal  digital  gates  in 
that  they  are  not  clocked.  Figure  7(a)  shows  a  static  NAND.  The  ad¬ 
vantages  of  a  static  implementation  include  the  relatively  low  power 
consumption  and  the  small  number  of  transistors  switching  simultaneous¬ 
ly.  However,  the  disadvantage  is  certainly  the  large  area  required  for 
static  CMOS  gates,  since  the  logic  must  be  duplicated  for  both  the  n 
and  the  p-type  transistors.  To  avoid  this  duplication  process,  dynamic 
CMOS  gates  are  often  used.  An  example  of  a  dynamic  NAND  is  shown  in 
Fig.  7(b).  Notice  that  certain  transistors  are  controlled  by  a  pre¬ 
charge  clock  ($).  Dynamic  gates  use  the  gate  capacitance  of  the 
following  transistors  to  store  a  pre-charge  value,  which  may  then  be 
modified  (charged  or  discharged)  during  the  evaluation  phase.  The  use 
of  dynamic  gates  often  permits  more  compact  designs,  but  there  are  a 
number  of  problems  associated  with  their  use.  In  particular,  race 
conditions  may  often  yield  inaccurate  results.  Also,  the  large  flow  of 
charges  during  the  pre-charge  phase  may  induce  a  condition  known  as 
latch-up,  which  prevents  proper  operation  of  the  circuit  (Ref.  15). 

Many  approaches  or  design  techniques  have  been  devised  to  reduce  these 
problems.  One  method  is  the  domino  logic  (Ref.  16)  in  which  dynamic 
gates  are  always  followed  by  a  static  inverter.  This  method  eliminates 
the  race  problem  because  the  pre-charge  phase  always  disables  all  the 
n~type  transistors.  An  example  of  a  domino  AND  gate  is  shown  in  Fig. 
7(c).  However,  this  method  precludes  the  use  of  inverting  logic. 
Another  method  proposed  by  Goncalves  and  De  Mann  (Ref.  17)  uses  alter¬ 
nating  n-logic  and  p-logic  gates,  so  that  the  pre-charge  of  one  gate 
always  disables  the  logic  transistors  in  the  next  gate.  A  number  of 
rules  are  described  in  this  report  to  ensure  that  race  conditions  do 
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not  occur  during  pre-charge.  This  method  has  been  termed  NoRa, 
standing  for  "No  Race".  An  example  using  three  NAND  gates  is  shown  in 
Fig.  7(d). 


(a)  Static  NAND 


FIGURE  7  -  CMOS  logic  gates 
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It  was  decided  to  use  static  CMOS  gates  for  this  project  because 
they  are  relatively  easy  to  use  and  are  less  likely  to  induce  a  sub¬ 
strate  latch-up.  Also,  because  most  of  the  gates  needed  are  relatively 
simple,  the  use  of  static  gates  is  not  expected  to  significantly  in¬ 
crease  the  layout  requirements. 

3.2  Common  Cells 


There  are  three  basic  cells  that  are  used  throughout  this 
design.  They  are  a  full-adder  and  two  types  of  flip-flops,  with  and 
without  preset.  This  section  will  describe  these  three  cells  and 
assess  their  performance  based  on  circuit  simulations  performed  using 
SPICE,  an  electrical  circuit  simulator  developed  at  the  University  of 
California,  Berkeley. 

3.2.1  Full-Adder 


A  full-adder  is  a  subcircuit  that  is  used  to  encode  three  bits 
into  a  sum  and  carry  format.  The  truth  table  of  this  function  is  shown 
in  Table  II,  while  the  logic  equations  (eqs.  3,  4)  are  shown  below: 


K  «  AB  +  AC  +  BC 


[3] 

[*] 


S  -  ABC  +  K(A  +  B  +  C) 


FIGURE  8  -  Full-adder  circuit  diagram 
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In  Table  II  and  eq.  4,  A,  B  and  C  are  the  inputs,  S  is  the  sum 
and  K  is  the  output  carry.  Ihe  advantage  of  the  formulation  shown  is 
that  no  inverted  signals  are  necessary  except  for  K.  A  transistor  dia¬ 
gram  of  a  CMOS  circuit  implementing  this  function  is  shown  in  Fig.  8. 
One  should  notice  that  K  is  generated  as  an  active  low  signal  (K)  so 
that  it  may  be  used  directly  to  generate  S. 

2 

The  circuit  was  laid  out  in  5u  CMOS  and  required  211  x  193  ym  . 

The  worst  case  propagation  delay  through  this  circuit  occurs  when  a 
change  in  the  input  signal  triggers  a  change  in  the  carry  which,  in 
turn,  forces  a  change  in  the  sum  output;  for  example,  when  the  input 
switches  from  ABC  =  100  (K=0,  S=l)  to  ABC  *  110  (K«l,  S=0).  This  situ¬ 
ation  was  simulated  using  SPICE.  The  maximum  propagation  delay  was 
found  to  be  approximately  15  ns. 

3.2.2  Flip-Flop  Without  Preset 

The  next  basic  cell  required  was  a  flip-flop.  This  cell  was 
required  to  store  the  coefficients  and  also  to  form  the  shift  regis¬ 
ters.  In  order  to  reduce  the  chances  of  race  conditions,  master-slave 
circuits  were  designed.  The  circuit  diagram  is  shown  in  Fig.  9. 

There  are  two  similar  subcircuits,  the  master  and  the  slave. 

The  first  one  stores  the  data  when  is  active,  otherwise  the  data  is 
recirculated  through  the  feedback.  The  slave  is  controlled  by  anc* 
operates  in  a  similar  way.  In  order  to  operate  this  flip-flop  correct¬ 
ly,  $  ^  and  ^  are  poised  sequentially.  It  is  important  to  note  that  if 
the  clocks  do  overloap,  a  race  condition  may  occur  causing  the  latch  to 


become  invisible. 
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FIGURE  9  -  Circuit  diagram  of  flip-flop  without  preset 
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The  layout  of  this  circuit  required  365  x  178  urn  .  Simulation 
results  indicate  that  the  flip-flop  hold  time  is  approximately  12  ns. 
This  was  computed  from  the  time  the  input  is  valid  until  the  time  the 
output  of  the  feedback  is  stabilized.  Ihe  output  is  valid  7  ns  after 
the  leading  edge  of  4 

3.2.3  Flip-Flop  With  Preset 

The  design  of  the  multiplier  (Section  3.3)  required  flip-flops 
tnat  could  be  preset.  The  circuit  diagram  is  shown  in  Fig.  10.  The 
basic  structure  and  operation  are  similar  to  the  other  flip-flop, 
except  for  the  presence  of  a  preset  signal  in  the  master  section. 

2 

The  layout  required  375  x  204  pm  .  Simulation  results  indicated 
a  preset  hold  time  of  4  ns.  The  flip-flop  hold  time  is  approximately 
12  ns  and  the  output  is  valid  7  ns  after  the  leading  edge  of 

3.3  Serial-Parallel  Multiplier 

The  next  circuit  that  must  be  designed  is  the  multiplier.  The 
serial-parallel  structure  was  selected  to  minimize  computation  time 
while  maintaining  a  minimum  number  of  I/O  signals.  This  structure  was 
possible  because  the  coefficients  are  rarely  modified  and  may  easily  be 
stored  in  a  parallel  form  on  chip. 

Let  us  now  investigate  how  this  multiplier  may  be  implemented. 
First,  the  coefficient  is  a  two's  complement  number  (sect.  2.3.1)  which 

may  be  represented  by  eq.  5.  The  c(i)  represent  the  i^  bit  of  the 
coefficient  C. 
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7  6  1 

C  =  -c(7)  2  +  l  c ( i)  21  [5] 

i-0 

The  product  may  then  be  described  by  eq.  6,  in  which  X’  is  the 
two’s  complement  representation  of  X. 

7  6  < 

CX  -  -X  c(7 )  2  +  (  [  c(i)  2  )  X  [6a] 

i-0 

7  6  1 

CX  -  c( 7)  X'  2  +  (  l  c(i)  2  )  X  [6b] 

i-0 

A  hardware  structure  to  implement  this  product  is  shown  in 
Fig.  11.  It  consists  of  seven  identical  blocks,  with  an  eighth  similar 
but  modified  to  properly  handle  the  sign  bit  (perform  the  two’s  comple¬ 
ment  of  the  X  input).  Each  block  computes  one  of  the  c(i)  2*X  product. 
A  control  signal  (Ctl)  is  used  to  identify  the  first  cycle  and  perform 
part  of  the  computation  of  the  two's  complement  of  X. 

The  actual  implementation  of  the  multiplier  was  accomplished 
using  the  basic  cells  described  in  Section  3.2.  Two  cells  were  neces¬ 
sary,  one  for  the  main  block  of  the  multiplier  and  one  to  handle  dif¬ 
ferences  between  the  sign  cell  and  the  normal  cell.  The  design  of  the 

2 

main  cell  is  shown  in  Fig.  12  and  requires  962  x  454  pm  ,  while  the 

2 

sign  cell  is  shown  in  Fig.  13  and  is  959  x  192  pm  .  Figure  14  shows 
how  these  cells  are  combined  to  form  the  desired  multiplier. 

Results  from  a  simulation  using  SPICE  allowed  us  to  estimate  the 
worst  multiplier  cell  delay  at  32  ns.  Because  of  the  intrinsic  pipe¬ 
line  used  in  the  multiplier,  this  value  also  corresponds  to  the  multi¬ 
plier  worst  delay  during  a  single  cycle.  Hence,  an  8  x  8  multiplica¬ 
tion  (16  cycles)  can  be  performed  in  512  (-16  x  32)  ns. 
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ResetM 


FIGURE  12  -  Main  cell  circuit  of  serial-parallel  multiplier 
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FIGURE  13  -  Sign  circuit  of  serial-parallel  multiplier 
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P  (5) 
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FIGURE  14  -  Block  diagram  of  multiplier  layout 
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3.4  Shift  Registers 

The  shift  registers  were  simple  to  design  using  the  master-slave 
flip-flops  described  in  Section  3.2.  However,  in  order  to  save  real- 
estate  on  the  chip,  another  layout  was  performed  using  the  same  circuit 
diagram  as  shown  in  Fig.  9.  The  basic  cell  could  then  be  duplicated  to 
form  long  shift  registers.  Figure  15  shows  a  16-bit  version,  which 
required  704  x  1327  pm^. 

3.5  Sum-Disable 

The  circuit  used  for  sum-disabling  is  shown  in  Fig.  16.  Sign- 
extension  is  done  by  replacing  the  multiplier  output  bits  with  the 
coefficient  sign.  This  is  valid  because  the  input  is  always  positive 
and  hence,  the  multiplier  output  sign  is  always  the  same  as  that  of  the 
coefficient. 


0^  Pe  0,  Pe 


FIGURE  15  -  Block  diagram  of  16-bit  shift  register 
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FIGURE  16  -  Circuit  diagram  of  sum-disable  circuit 


3.6  Adder 


A  serial  adder  is  used  to  sum  the  multiplier  output  with  the 

partial  sum  input.  This  circuit  is  shown  in  Fig.  17.  It  incorporates 

a  full-adder,  a  flip-flop  for  the  carry,  and  a  flip-flop  for  the  input 

(for  synchronization  with  the  multiplier  output).  The  actual  layout 

also  incorporates  the  sum-disable  circuit  as  well  as  a  flip-flop  used 

2 

by  the  overflow  circuit.  The  cell  requires  963  x  465  pm  . 

3.7  Overflow 


When  two  n-bit  numbers  are  added,  an  overflow  or  underflow  is 

detected  when  the  carry  from  the  (n-1)  addition  is  different  from  the 

carry  from  the  nth  addition.  The  overflow  circuit  diagram  is  shown  in 

2 

Fig.  18.  The  layout  of  this  circuit  requires  938  x  490  pm  ,  and  the 
circuit  delay  was  estimated  at  17  ns. 

3.8  Clock  Drivers 

Because  of  the  large  number  of  transistor  gates  that  must  be 
driven  from  the  clocks,  special  drivers  were  designed.  They  are  used 
to  produce  the  control  signals  (and  their  inverse)  to  enable  the 
different  flip-flops. 


< 
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Sum  Disable  Output 


Reset A 


FIGURE  17  -  Circuit  diagram  of  serial  adder 


0  ResetA 


FIGURE  18  -  Circuit  diagram  of  overflow  detection 
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3.9  Processor 


All  the  cells  were  designed  and  laid  out  to  minimize  the  nunber 
of  connections  required  when  assembling  the  processor.  Figure  19  shows 
the  basic  layout  of  a  single  processor.  The  size  of  the  processor  is 
1752  x  4682  pm2. 

3.10  Chip 

Finally,  the  circuit  layout  was  completed  at  the  chip  level  with 
the  inclusion  of  two  processors  on  a  single  chip.  The  layout  is  shown 
in  block  diagram  form  in  Fig.  20.  On  the  diagram,  the  small  boxes  on 
the  periphery  are  the  pad  drivers,  which  provide  the  interface  to  the 
outside  world.  The  boxes  marked  (4>^Pe,  etc*)  generate  the 

control  signals  required  by  the  processors,  which  are  represented  by 
the  large  boxes. 

2 

The  chip  dimensions  are  4700  x  4700  urn  .  A  total  of  19  pads 
were  used,  2  for  power,  12  input,  and  5  output.  A  total  of  3380 
transistors  are  used  in  this  design.  The  worst  case  propagation  delay 
was  estimated  at  92  ns,  for  a  maximum  clock  frequency  of  10.9  MHz.  The 
power  requirements  for  this  circuit  is  estimated  to  be  70  mW  at  its 
maximum  operating  frequency. 
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ResetM 
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FIGURE  19  -  Block  diagram  of  processor  layout 


UNCLASSIFIED 

32 


FIGURE  20  -  Block  diagram  of  chip  layout 
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4.0  CONCLUSION 


Digital  convolution  is  certainly  among  the  most  commonly  used 
processing  functions  in  image  processing.  However,  it  is  computation¬ 
ally  expensive;  a  total  of  (M  x  N)  x  (I  x  J)  multiplications  and  (M  x 
N)  x  (I  x  J  -  1)  additions  are  required  to  process  an  M  x  N  image  by  an 
I  x  J  coefficient  array. 

This  report  described  the  design  of  a  VLSI  systolic  convolution 
circuit  which  was  implemented  in  a  5y  CMOS  process.  It  uses  a  pipe¬ 
lining  technique  to  divide  the  computational  task  among  a  number  of 
identical,  synchronous  processors.  The  time  required  to  perform  digi¬ 
tal  convolution  is  proportional  to  the  size  of  the  image  because  the 
system  uses  one  processor  per  coefficient. 

Bit-serial  communications  and  processing  were  used  throughout 
this  design  to  reduce  the  processor  size  (maximize  density)  as  well  as 
the  number  of  I/O  pins,  thus  allowing  the  use  of  smaller  packages.  A 
total  of  19  pins  were  used. 

The  coefficients  were  represented  as  8-bit,  two's  complement 
numbers  while  the  pixel  values  were  restricted  to  a  range  from  0  to 
255.  This  was  done  to  allow  for  easy  interfacing  with  standard  digi¬ 
tizers  as  well  as  a  reasonable  range  of  coefficients. 

Overflow  detection  and  multiplier  scaling  capabilities  were  in¬ 
cluded  in  the  design  so  that  this  processor  can  be  used  for  autonomous 
applications.  Moreover,  with  these  capabilities,  the  user  can  carry 
out  simple  or  sophisticated  error  recovery. 
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Two  systolic  processors  were  pat  on  a  single  integrated  circuit 
of  5  mm  x  5  mm  with  all  interconnections  done  internally.  The  circuit 
simulations  show  that  it  should  be  able  to  operate  at  up  to  10  MHz, 
allowing  it  to  process  a  512  x  512  pixel  image  in  approximately  0.35  s. 
Power  consumption  was  estimated  to  be  25  raW  when  operating  at  4  MHz. 
More  details  about  this  design  may  be  found  in  Ref.  18. 
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